Pandas is a really powerful data analysis library in Python. This is not an exhaustive
reference to its methods. I did my best to include much of the most used and
more difficult to remember methods I have used over the last few years.
I encourage anyone attempting to work through this page to first get a solid grasp of the
Python language generally. This will make understanding the Pandas methods much easier to
comprehend.
Here I mostly cover accessing and handling data. The Pandas library is rather large and I
would encourage anyone reading this to also reference the documentation availible on Pandas
API here.
I use more applied methods in the data project examples pages. But understanding the methods to
access and handle data are an essential first step.
Basics of a DataFrame
Let's first go over what a dataframe is and its parts. A dataframe is the most common
way to display and subsequently analyze data stored in xyz form.
Basic dataframe parts:
Columns and rows are obvious. What may seem more complicated is the index—which technically is not
a column. The index is used to organize data at specific positions. It is like a grid coordinate
system to organizing data—essentially this is like an array.
Many of the methods on this page may seem abstract. Notably the indexing methods. But once you know
your way around a dataframe many of the operations pertaining to analyzing a dataframe will be more
easily understood. Once you can access the data, you can apply the necessary operations to it.
The examples here include dataframes that are rather small to illustrate methods. In all practicality dataframes
are usually anywhere from hundreds, thousands to millions of data points in size.
Simple Imports and Exports
To import or export most data in pandas you can do so with the following:
Import/export data and warnings when importing in Pandas:
import pandas as pd
import numpy as np
df = pd.read_<FileType>(r'<filePath>')
#Read file by Type:
pd.read_csv(r'filePath', engine='python')
pd.read_json(r'filePath')
pd.read_excel(r'filePath')
pd.read_hdf(r'filePath')
pd.read_sql(r'filePath')
pd.read_fwf(r'filePath') #for txt docs
#On occasion you run into a utf-- codec error;
#In most cases, to circumvent this error add engine='python' argument
pd.read_excel(r'filePath', engine='python') #reads through UnicodeDecodeError: 'utf-8' codec can't...
#Save/Export by file by type:
df.to_csv(r'filePath')
df.to_json(r'filePath')
df.to_excel(r'filePath')
df.to_hdf(r'filePath')
df.to_sql(r'filePath')
#Turn off all warnings:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
Basic data import example:
"""
Import some data. The variable (df_0) is assigned to the dataframe.
The (r) preceding the file path tells Python to read the path as a raw string.
If you do not use it it may return an exception because it may interpret the
backslashes as escape characters.
"""
In[]: df_0 = pd.read_csv(r'path\to\file\some_file_name.csv')
"""
Look at the dataframe by calling df_0. This data is completely random and
has no significance except to demonstrate some initial methods here. I will
use this dataframe and others in subsequent examples.
"""
In[]: df_0
Out[]:
col_1 col_2 col_3 col_4 col_5 col_6 col_7 col_8 col_9 col_10
0 Italy 332 3.38 16 461 969 260 30.0 -12 True
1 Greece 198 7.00 42 770 107 229 85.0 -65 False
2 Spain 676 0.84 45 650 82 354 51.0 -56 False
3 Rome 608 2.00 76 408 84 538 98.0 -93 True
4 South Africa 923 5.00 40 686 346 352 31.0 -37 True
5 Brazil 272 6.07 90 891 487 348 NaN 94 True
6 Romania 658 5.38 60 13 695 529 45.0 25 False
7 Japan 582 5.69 1 159 807 77 9.0 99 False
8 Canada 236 7.38 39 797 633 14 65.0 72 False
9 India 6 3.92 65 719 887 612 12.0 40 True
Some highlighted imports:
'''
Here are some imports that you may use in the future.
I will use these in some data project examples.
'''
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn as sk
from bs4 import BeautifulSoup
from urllib.request import urlopen
import matplotlib.pyplot as plt
import seaborn as sns
import re
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests
import pprint
import time as tm #unit default is milliseconds from 1/1970
import datetime as dt
Display
Setting the display options can be benificial for custom viewing of data.
Col and Row Display:
#Max rows and cols; max width and height; removes the ellipsis (...) if the data is large;
#'None' here displays all data;
pd.options.display.max_columns = None
pd.options.display.max_rows = None
pd.set_option('max_colwidth', None)
pd.set_option('max_rows', None)
#Display the whole df without a new line separator (\):
pd.set_option('display.expand_frame_repr', False)
#Turn off all warnings and caveats:
warnings.filterwarnings("ignore")
#Turn off all warnings and caveats:
warnings.filterwarnings("ignore")
Data frame information:
Display basic dataframe information to give you an idea of what you are
looking at. It is also extremely useful for identifying the data types of
the columns—these are very important when trying to work with that data.
Basic dataframe information methods: Here is a dataframe we are to use in the following examples.
'''
Call the columns into an array (list). In some instances you may run into data that
is structured with a large number of columns. Calling the columns can help pull the
data needed for whatever analysis.
'''
In[]: df_0.columns
Out[]:
Index(['col_1', 'col_2', 'col_3', 'col_4', 'col_5', 'col_6', 'col_7', 'col_8',
'col_9', 'col_10'],
dtype='object')
Call the head of the dataframe:
'''
Calling the first five rows and columns of a dataframe can be useful for an initial
look into the data. If you have a dataframe with a massive amount of rows and/or columns
it is impractal to view the entire data. Here we use a built in function that calls the
first five rows and columns.
'''
In[]: df_0.head()
Out[]:
col_1 col_2 col_3 col_4 col_5 col_6 col_7 col_8 col_9 col_10
0 Italy 332 3.38 16 461 969 260 30.0 -12 True
1 Greece 198 7.00 42 770 107 229 85.0 -65 False
2 Spain 676 0.84 45 650 82 354 51.0 -56 False
3 Rome 608 2.00 76 408 84 538 98.0 -93 True
4 South Africa 923 5.00 40 686 346 352 31.0 -37 True
Call the tail of the dataframe:
'''
Now we can use another built in function to look at the last five rows and columns.
'''
In[]: df_0.tail()
Out[]:
col_1 col_2 col_3 col_4 col_5 col_6 col_7 col_8 col_9 col_10
5 Brazil 272 6.07 90 891 487 348 NaN 94 True
6 Romania 658 5.38 60 13 695 529 45.0 25 False
7 Japan 582 5.69 1 159 807 77 9.0 99 False
8 Canada 236 7.38 39 797 633 14 65.0 72 False
9 India 6 3.92 65 719 887 612 12.0 40 True
Call the shape of the dataframe:
'''
Show the shape of the dataframe with shape.
This returns a tuple of the shape of the data.
'''
In[]: df_0.shape
Out[]: (10, 10)
Call the datatypes of each column:
'''
Now let's look at the datatypes in the dataframe with dtypes. Note that
the data types of each column may not be what it initially appears
when looking at the actual dataframe—specifically col_7 here. In the
dataframe it seems to be integers but in the actual data it is coded as
an object. This means that the data in col_7 are actually strings because
object in this return is equal to a string type data.
'''
In[]: df_0.dtypes
Out[]:
col_1 object
col_2 int64
col_3 float64
col_4 int64
col_5 int64
col_6 int64
col_7 object
col_8 float64
col_9 int64
col_10 bool
dtype: object
Call detailed information of each column:
'''
Get more detailed information on the data with info(). When looking
at the return of info() we see that the columns also have an index (left).
Next, follows the column names. Followed by the non-null counts of each
column—meaning our data frame is shaped (10,10) so of all the data col_8
has 1 null data cell. Null (NaN) data represents missing data.
'''
In[]: df_0.info()
Out[]:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 col_1 10 non-null object
1 col_2 10 non-null int64
2 col_3 10 non-null float64
3 col_4 10 non-null int64
4 col_5 10 non-null int64
5 col_6 10 non-null int64
6 col_7 10 non-null object
7 col_8 9 non-null float64
8 col_9 10 non-null int64
9 col_10 10 non-null bool
dtypes: bool(1), float64(2), int64(5), object(2)
memory usage: 858.0+ bytes
Basic Dataframe Index Slicing With .loc and .iloc
Index slicing will be very useful for scoping in on particular data.
If we look at how a dataframe is structured by its index we can see what
are essentially grid coordinates of each cell, row, and column.
Accessing data in a dataframe allows one to apply methods for xyz analysis to that data.
For example adding one column to another column to generate a third column, applying
an operator for statistical analysis, or converting a column based on some condition. These
are only a few examples of many.
Here I will start very simple with df_0 (same dataframe as above).
'''
The basic syntax for .iloc is SomeDataframe.iloc[Row_Index, Col_Index].
'''
In[]: df_0.iloc[0,0]
Out[]: 'Italy'
In[]: df_0.iloc[9,0]
Out[]: 'India'
In[]: df_0.iloc[0,4]
Out[]: 461
#remember 'col_7' at, axis 1 index 6, are string objects not integers;
#I will show methods to convert data types later
In[]: df_0.iloc[0,6]
Out[]: '260'
'''
You can now assign specific data to other variable objects too.
'''
In[]: data_var_1 = df_0.iloc[0,4]
In[]: data_var_1 * 2 / 3
Out[]: 307.3333333333333
#access a single column as a series
In[]: df_0.iloc[0:,0]
Out[]:
0 Italy
1 Greece
2 Spain
3 Rome
4 South Africa
5 Brazil
6 Romania
7 Japan
8 Canada
9 India
Name: col_1, dtype: object
#accessing a single column as a dataframe
In[]: df_0.iloc[0:,0:1]
Out[]:
col_1
0 Italy
1 Greece
2 Spain
3 Rome
4 South Africa
5 Brazil
6 Romania
7 Japan
8 Canada
9 India
#access several columns
In[]: df_0.iloc[0:,2:5]
Out[]:
col_3 col_4 col_5
0 3.38 16 461
1 7.00 42 770
2 0.84 45 650
3 2.00 76 408
4 5.00 40 686
5 6.07 90 891
6 5.38 60 13
7 5.69 1 159
8 7.38 39 797
9 3.92 65 719
#access all columns after specified index
In[]: df_0.iloc[0:,7:]
Out[]:
col_8 col_9 col_10
0 30.0 -12 True
1 85.0 -65 False
2 51.0 -56 False
3 98.0 -93 True
4 31.0 -37 True
5 NaN 94 True
6 45.0 25 False
7 9.0 99 False
8 65.0 72 False
9 12.0 40 True
#access every nth columns
In[]: df_0.iloc[0:,0::2]
Out[]:
col_1 col_3 col_5 col_7 col_9
0 Italy 3.38 461 260 -12
1 Greece 7.00 770 229 -65
2 Spain 0.84 650 354 -56
3 Rome 2.00 408 538 -93
4 South Africa 5.00 686 352 -37
5 Brazil 6.07 891 348 94
6 Romania 5.38 13 529 25
7 Japan 5.69 159 77 99
8 Canada 7.38 797 14 72
9 India 3.92 719 612 40
Other Slicing Methods
Here I include some methods—outside of .loc and iloc—that I found very useful for isolating rows or cols.
Again, here is the example dataframe we will be using:
#use the column names
In[]: new_df = df_0[['col_1', 'col_3', 'col_9']]
In[]: new_df
Out[]:
col_1 col_3 col_9
0 Italy 3.38 -12
1 Greece 7.00 -65
2 Spain 0.84 -56
3 Rome 2.00 -93
4 South Africa 5.00 -37
5 Brazil 6.07 94
6 Romania 5.38 25
7 Japan 5.69 99
8 Canada 7.38 72
9 India 3.92 40
Slice rows by row index:
'''
This does not change the data unless you re-assign to
a new object variable or add the argument inplace=True
'''
In[]: df_0.drop([0,1,2,3]) #df_0.drop([0,1,2,3], inplace=True) ; makes a permanent change
Out[]:
col_1 col_2 col_3 col_4 col_5 col_6 col_7 col_8 col_9 col_10
4 South Africa 923 5.00 40 686 346 352 31.0 -37 True
5 Brazil 272 6.07 90 891 487 348 NaN 94 True
6 Romania 658 5.38 60 13 695 529 45.0 25 False
7 Japan 582 5.69 1 159 807 77 9.0 99 False
8 Canada 236 7.38 39 797 633 14 65.0 72 False
9 India 6 3.92 65 719 887 612 12.0 40 True
Use a list as an index filter:
'''
You generate you own list of integers to filter the index as well. Here, this
drops all the even indexed rows.
'''
In[]: idx_every_two = [i for i in range(0,9,2)]
In[]: df_0.drop(idx_every_two)
Out[]:
col_1 col_2 col_3 col_4 col_5 col_6 col_7 col_8 col_9 col_10
1 Greece 198 7.00 42 770 107 229 85.0 -65 False
3 Rome 608 2.00 76 408 84 538 98.0 -93 True
5 Brazil 272 6.07 90 891 487 348 NaN 94 True
7 Japan 582 5.69 1 159 807 77 9.0 99 False
9 India 6 3.92 65 719 887 612 12.0 40 True
Conditional Index Slicing
Conditional index slicing can get rather complex because some of the syntax is not very intuitive.
'''
Here we select all rows where col_5 is less than 500. If the data in the column is
numeric you can implement any comparison operator to perform a conditional extraction
of data. If it is string data you can still perform operations that are valid for that data.
'''
In[]: df_0.loc[df_0['col_5'] < 500]
Out[]:
col_1 col_2 col_3 col_4 col_5 col_6 col_7 col_8 col_9 col_10
0 Italy 332 3.38 16 461 969 260 30.0 -12 True
3 Rome 608 2.00 76 408 84 538 98.0 -93 True
6 Romania 658 5.38 60 13 695 529 45.0 25 False
7 Japan 582 5.69 1 159 807 77 9.0 99 False
'''
Here is another example useing col_10 by selecting all rows where col_10 is True
'''
In[]: df_0.loc[df_0['col_10'] == True]
Out[]:
col_1 col_2 col_3 col_4 col_5 col_6 col_7 col_8 col_9 col_10
0 Italy 332 3.38 16 461 969 260 30.0 -12 True
3 Rome 608 2.00 76 408 84 538 98.0 -93 True
4 South Africa 923 5.00 40 686 346 352 31.0 -37 True
5 Brazil 272 6.07 90 891 487 348 NaN 94 True
9 India 6 3.92 65 719 887 612 12.0 40 True
Slice rows based on multipul columns and data conditions:
'''
Here we will slice data where it is greater than 500 in col_2 and False in col_10
'''
In[]: df_0.loc[(df_0['col_2'] > 500) & (df_0['col_10'] == False)]
Out[]:
col_1 col_2 col_3 col_4 col_5 col_6 col_7 col_8 col_9 col_10
2 Spain 676 0.84 45 650 82 354 51.0 -56 False
6 Romania 658 5.38 60 13 695 529 45.0 25 False
7 Japan 582 5.69 1 159 807 77 9.0 99 False
Slice a dataframe based on string column values:
'''
Here I used the string method startswith() but you can use some other string methods.
'''
In[]: df_0.loc[df_0['col_1'].str.startswith("I")]
Out[]:
col_1 col_2 col_3 col_4 col_5 col_6 col_7 col_8 col_9 col_10
0 Italy 332 3.38 16 461 969 260 30.0 -12 True
9 India 6 3.92 65 719 887 612 12.0 40 True
Slice a dataframe based a numeric condition and a alphabetical condition:
'''
A string that starts with 'I' in col_1 and a integer value less that 50 in col_4.
'''
In[]: df_0.loc[(df_0['col_1'].str.startswith("I")) & (df_0['col_4'] < 50)]
Out[]:
col_1 col_2 col_3 col_4 col_5 col_6 col_7 col_8 col_9 col_10
0 Italy 332 3.38 16 461 969 260 30.0 -12 True
'''
Returns rows that endwith 'a' in col_1 and are
less than 100 in col_4 and less than 80 in col_9.
'''
In[]: df_0.loc[(df_0['col_1'].str.endswith("a")) & (df_0['col_4'] < 100) & (df_0['col_9'] < 80)]
Out[]:
col_1 col_2 col_3 col_4 col_5 col_6 col_7 col_8 col_9 col_10
4 South Africa 923 5.00 40 686 346 352 31.0 -37 True
6 Romania 658 5.38 60 13 695 529 45.0 25 False
8 Canada 236 7.38 39 797 633 14 65.0 72 False
9 India 6 3.92 65 719 887 612 12.0 40 True
Applying Operators to Columns And Rows
Applying operations to columns and rows may be needed for a many different reasons.
Once you know how to access the data applying operations to it becomes very easy.
'''
From the original data;
Here we added 100 to each data point in columns [col_2, col_3, col_4, col_5]
'''
In[]: df_0.iloc[0:,1:5] += 100
In[]: df_0
Out[]:
col_1 col_2 col_3 col_4 col_5 col_6 col_7 col_8 col_9 col_10
0 Italy 432 103.38 116 561 969 260 30.0 -12 True
1 Greece 298 107.00 142 870 107 229 85.0 -65 False
2 Spain 776 100.84 145 750 82 354 51.0 -56 False
3 Rome 708 102.00 176 508 84 538 98.0 -93 True
4 South Africa 1023 105.00 140 786 346 352 31.0 -37 True
5 Brazil 372 106.07 190 991 487 348 NaN 94 True
6 Romania 758 105.38 160 113 695 529 45.0 25 False
7 Japan 682 105.69 101 259 807 77 9.0 99 False
8 Canada 336 107.38 139 897 633 14 65.0 72 False
9 India 106 103.92 165 819 887 612 12.0 40 True
Create Columns Based on Conditions Of Other Columns
Here are examples of things you may need to do to anayze data based on conditions.
Most of the time each column represents a variable of some kind. These variables can
be used to craft other calculations and variables.
Here again is our dataframe:
In[]: df_0
Out[]:
col_1 col_2 col_3 col_4 col_5 col_6 col_7 col_8 col_9 col_10
0 Italy 332 3.38 16 461 969 260 30.0 -12 True
1 Greece 198 7.00 42 770 107 229 85.0 -65 False
2 Spain 676 0.84 45 650 82 354 51.0 -56 False
3 Rome 608 2.00 76 408 84 538 98.0 -93 True
4 South Africa 923 5.00 40 686 346 352 31.0 -37 True
5 Brazil 272 6.07 90 891 487 348 NaN 94 True
6 Romania 658 5.38 60 13 695 529 45.0 25 False
7 Japan 582 5.69 1 159 807 77 9.0 99 False
8 Canada 236 7.38 39 797 633 14 65.0 72 False
9 India 6 3.92 65 719 887 612 12.0 40 True
'''
Say we need to find all the values in col_5 that are greater than values in col_2.
Here is how we create a new column (col_11) to display the results across each
columns respective rows as True if col_5 is greater than col_2 and False if col_2
is less than col_5.
'''
In[]: df_0['col_11'] = df_0['col_5'] > df_0['col_2']
In[]: df_0
Out[]:
col_1 col_2 col_3 col_4 col_5 col_6 col_7 col_8 col_9 col_10 col_11
0 Italy 432 103.38 116 561 969 260 30.0 -12 True True
1 Greece 298 107.00 142 870 107 229 85.0 -65 False True
2 Spain 776 100.84 145 750 82 354 51.0 -56 False False
3 Rome 708 102.00 176 508 84 538 98.0 -93 True False
4 South Africa 1023 105.00 140 786 346 352 31.0 -37 True False
5 Brazil 372 106.07 190 991 487 348 NaN 94 True True
6 Romania 758 105.38 160 113 695 529 45.0 25 False False
7 Japan 682 105.69 101 259 807 77 9.0 99 False False
8 Canada 336 107.38 139 897 633 14 65.0 72 False True
9 India 106 103.92 165 819 887 612 12.0 40 True True
'''
Or, say we need to add a column that is the sum of col_3 and col_4 divided by 100.
'''
In[]: df_0['col_12'] = (df_0['col_3'] + df_0['col_4']) / 100
In[]: df_0
Out[]:
col_1 col_2 col_3 col_4 col_5 col_6 col_7 col_8 col_9 col_10 col_11 col_12
0 Italy 332 3.38 16 461 969 260 30.0 -12 True True 0.1938
1 Greece 198 7.00 42 770 107 229 85.0 -65 False True 0.4900
2 Spain 676 0.84 45 650 82 354 51.0 -56 False False 0.4584
3 Rome 608 2.00 76 408 84 538 98.0 -93 True False 0.7800
4 South Africa 923 5.00 40 686 346 352 31.0 -37 True False 0.4500
5 Brazil 272 6.07 90 891 487 348 NaN 94 True True 0.9607
6 Romania 658 5.38 60 13 695 529 45.0 25 False False 0.6538
7 Japan 582 5.69 1 159 807 77 9.0 99 False False 0.0669
8 Canada 236 7.38 39 797 633 14 65.0 72 False True 0.4638
9 India 6 3.92 65 719 887 612 12.0 40 True True 0.6892
Merging Dataframes
Merging dataframes is also known as concatination.
'''
Here are two data frames that we can merge—df_3 and df_4.
'''
In[]: df_3
Out[]:
datetime col_1 col_2
0 2009-1-20 1 22
1 2009-2-23 2 23
2 2009-8-24 3 24
3 2010-7-24 4 25
4 2010-8-22 5 26
5 2010-7-21 6 27
6 2010-8-15 7 28
7 2011-4-22 8 29
8 2011-2-22 9 30
9 2011-8-25 10 31
In[]: df_4
Out[]:
country date var_1 var_2
0 Italy 2009-1-20 3.38 16
1 Greece 2009-2-23 7.00 42
2 Spain 2009-8-24 1.90 66
3 Rome 2010-7-24 2.00 76
4 South Africa 2010-8-22 5.00 40
5 Brazil 2010-7-21 6.07 22
6 Romania 2010-8-15 5.38 60
7 Japan 2011-4-22 5.69 1
8 Canada 2011-2-22 7.38 39
9 India 2011-8-25 3.92 65
'''
One easy way to merge df_3 and df_4 is by index.
'''
In[]: df_4.merge(df_3, left_index=True, right_index=True, how='right')
Out[]:
country date var_1 var_2 datetime col_1 col_2
0 Italy 2009-1-20 3.38 16 2009-1-20 1 22
1 Greece 2009-2-23 7.00 42 2009-2-23 2 23
2 Spain 2009-8-24 1.90 66 2009-8-24 3 24
3 Rome 2010-7-24 2.00 76 2010-7-24 4 25
4 South Africa 2010-8-22 5.00 40 2010-8-22 5 26
5 Brazil 2010-7-21 6.07 22 2010-7-21 6 27
6 Romania 2010-8-15 5.38 60 2010-8-15 7 28
7 Japan 2011-4-22 5.69 1 2011-4-22 8 29
8 Canada 2011-2-22 7.38 39 2011-2-22 9 30
9 India 2011-8-25 3.92 65 2011-8-25 10 31
'''
You might run into the dates being out of order on one of the dataframes.
But the column that is being used to merge must have the same name in both
dataframes.
Like this:
'''
In[]: df_3
Out[]:
datetime col_1 col_2
0 2010-8-22 1 22
1 2009-2-23 2 23
2 2011-8-25 3 24
3 2011-2-22 4 25
4 2009-8-24 5 26
5 2009-1-20 6 27
6 2010-7-24 7 28
7 2010-7-21 8 29
8 2010-8-15 9 30
9 2011-4-22 10 31
In[]: df_4
Out[]:
country datetime var_1 var_2
0 Italy 2009-1-20 3.38 16
1 Greece 2009-2-23 7.00 42
2 Spain 2009-8-24 1.90 66
3 Rome 2010-7-24 2.00 76
4 South Africa 2010-8-22 5.00 40
5 Brazil 2010-7-21 6.07 22
6 Romania 2010-8-15 5.38 60
7 Japan 2011-4-22 5.69 1
8 Canada 2011-2-22 7.38 39
9 India 2011-8-25 3.92 65
#Merge by datetime column.
In[]: df_4.merge(df_3, on='datetime', how='right')
Out[]:
country datetime var_1 var_2 col_1 col_2
0 South Africa 2010-8-22 5.00 40 1 22
1 Greece 2009-2-23 7.00 42 2 23
2 India 2011-8-25 3.92 65 3 24
3 Canada 2011-2-22 7.38 39 4 25
4 Spain 2009-8-24 1.90 66 5 26
5 Italy 2009-1-20 3.38 16 6 27
6 Rome 2010-7-24 2.00 76 7 28
7 Brazil 2010-7-21 6.07 22 8 29
8 Romania 2010-8-15 5.38 60 9 30
9 Japan 2011-4-22 5.69 1 10 31
Data Cleaning
You often will not start with a clean dataframe. Meaning that you will
be required to do some cleaning of the data—that is without impacting its
validitiy. One of the major goals here is re-craft the data's uniformity.
You want to be working with uniform data. So, for these examples I will use
a new dataframe that is dirty and clean it up.
You often will not start with clean data—especially if the data was scraped from
a website.
Here is a dirty dataframe example:
In[]: dirty_df
Out[]:
col_1 col_2 col_3 col_4 col_5 col_6 col_7 col_8 col_9
0 Italy()2 332 3.38 16.0 461 969.0 260 30.0 -12
1 GrEECE(1) 198 7.00 42.0 770 107.0 229 85.0 -65
2 (Spain) 676 NaN NaN 650 82.0 354 51.0 -56
3 !.Rom e -- 2.00 76.0 408 84.0 538 98.0 -93
4 sOuth A!!frica 923 5.00 40.0 686 346.0 352 31.0 -37
5 (11)Brazil 272 6.07 NaN 891 NaN 348 NaN 94
6 *Romania 658 5.38 60.0 13 695.0 529 45.0 25
7 Japan00 582 5.69 1.0 159 807.0 77 9.0 99
8 Canada6 236 7.38 39.0 797 633.0 14 65.0 72
9 !!India 6 3.92 65.0 719 887.0 612 12.0 40
'''
Let's start with looking at the info() of the dataframe.
'''
In[]: dirty_df.info()
Out[]:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 col_1 10 non-null object
1 col_2 10 non-null object #because there is one string data in col_2 it shows as object data
2 col_3 9 non-null float64
3 col_4 8 non-null float64
4 col_5 10 non-null int64
5 col_6 9 non-null float64
6 col_7 10 non-null object
7 col_8 9 non-null float64
8 col_9 10 non-null int64
dtypes: bool(1), float64(4), int64(2), object(3)
memory usage: 858.0+ bytes
'''
Let's start with col_1 and strip all the unwanted string vals and attain only the countries strings.
'''
#first lets remove the digits from the data
#the following line uses regex to replace anything that isn't a letter or integer
In[]: dirty_df['col_1'].replace('\W+', '',regex=True,inplace=True)
In[]: dirty_df
Out[]:
col_1 col_2 col_3 col_4 col_5 col_6 col_7 col_8 col_9
0 Italy2 332 3.38 16.0 461 969.0 260 30.0 -12
1 GrEECE1 198 7.00 42.0 770 107.0 229 85.0 -65
2 Spain 676 NaN NaN 650 82.0 354 51.0 -56
3 Rome -- 2.00 76.0 408 84.0 538 98.0 -93
4 sOuthAfrica 923 5.00 40.0 686 346.0 352 31.0 -37
5 11Brazil 272 6.07 NaN 891 NaN 348 NaN 94
6 Romania 658 5.38 60.0 13 695.0 529 45.0 25
7 Japan00 582 5.69 1.0 159 807.0 77 9.0 99
8 Canada6 236 7.38 39.0 797 633.0 14 65.0 72
9 India 6 3.92 65.0 719 887.0 612 12.0 40
#now remove all the unwanted trailing or leading numbers with the same method
In[]: dirty_df['col_1'].replace('\d+', '',regex=True,inplace=True)
In[]: dirty_df
Out[]:
col_1 col_2 col_3 col_4 col_5 col_6 col_7 col_8 col_9
0 Italy 332 3.38 16.0 461 969.0 260 30.0 -12
1 GrEECE 198 7.00 42.0 770 107.0 229 85.0 -65
2 Spain 676 NaN NaN 650 82.0 354 51.0 -56
3 Rome -- 2.00 76.0 408 84.0 538 98.0 -93
4 sOuthAfrica 923 5.00 40.0 686 346.0 352 31.0 -37
5 Brazil 272 6.07 NaN 891 NaN 348 NaN 94
6 Romania 658 5.38 60.0 13 695.0 529 45.0 25
7 Japan 582 5.69 1.0 159 807.0 77 9.0 99
8 Canada 236 7.38 39.0 797 633.0 14 65.0 72
9 India 6 3.92 65.0 719 887.0 612 12.0 40
'''
Now let's make col_1's case uniform. Here we use upper().
'''
In[]: dirty_df['col_1'] = dirty_df['col_1'].str.upper()
In[]: dirty_df
Out[]:
col_1 col_2 col_3 col_4 col_5 col_6 col_7 col_8 col_9
0 ITALY 332 3.38 16.0 461 969.0 260 30.0 -12
1 GREECE 198 7.00 42.0 770 107.0 229 85.0 -65
2 SPAIN 676 NaN NaN 650 82.0 354 51.0 -56
3 ROME -- 2.00 76.0 408 84.0 538 98.0 -93
4 SOUTHAFRICA 923 5.00 40.0 686 346.0 352 31.0 -37
5 BRAZIL 272 6.07 NaN 891 NaN 348 NaN 94
6 ROMANIA 658 5.38 60.0 13 695.0 529 45.0 25
7 JAPAN 582 5.69 1.0 159 807.0 77 9.0 99
8 CANADA 236 7.38 39.0 797 633.0 14 65.0 72
9 INDIA 6 3.92 65.0 719 887.0 612 12.0 40
'''
Now convert the NaN data. This method will change all the Nan values to 0.0.
You may need to do this during an analysis because the dataframe needs to be
equally shaped—meaning, in this instance, you can't have less than 10 items in
a column. Changing the Nan values to 0 usually won't invalidate the data and
allows you to perform other operations—because it doesn't change the shape of
the distribution.
'''
In[]: dirty_df.fillna(0, inplace=True)
In[]: dirty_df
Out[]:
col_1 col_2 col_3 col_4 col_5 col_6 col_7 col_8 col_9
0 ITALY 332 3.38 16.0 461 969.0 260 30.0 -12
1 GREECE 198 7.00 42.0 770 107.0 229 85.0 -65
2 SPAIN 676 0.00 0.0 650 82.0 354 51.0 -56
3 ROME -- 2.00 76.0 408 84.0 538 98.0 -93
4 SOUTHAFRICA 923 5.00 40.0 686 346.0 352 31.0 -37
5 BRAZIL 272 6.07 0.0 891 0.0 348 0.0 94
6 ROMANIA 658 5.38 60.0 13 695.0 529 45.0 25
7 JAPAN 582 5.69 1.0 159 807.0 77 9.0 99
8 CANADA 236 7.38 39.0 797 633.0 14 65.0 72
9 INDIA 6 3.92 65.0 719 887.0 612 12.0 40
'''
Finally let's get replace the '--' in col_2 (which could be unreported data) with 0.
'''
In[]: dirty_df['col_2'].replace('--', '0',regex=True,inplace=True)
In[]: dirty_df
Out[]:
col_1 col_2 col_3 col_4 col_5 col_6 col_7 col_8 col_9
0 ITALY 332 3.38 16.0 461 969.0 260 30.0 -12
1 GREECE 198 7.00 42.0 770 107.0 229 85.0 -65
2 SPAIN 676 0.00 0.0 650 82.0 354 51.0 -56
3 ROME 0 2.00 76.0 408 84.0 538 98.0 -93
4 SOUTHAFRICA 923 5.00 40.0 686 346.0 352 31.0 -37
5 BRAZIL 272 6.07 0.0 891 0.0 348 0.0 94
6 ROMANIA 658 5.38 60.0 13 695.0 529 45.0 25
7 JAPAN 582 5.69 1.0 159 807.0 77 9.0 99
8 CANADA 236 7.38 39.0 797 633.0 14 65.0 72
9 INDIA 6 3.92 65.0 719 887.0 612 12.0 40
Converting Data:
Here is a really basic example to convert datatypes of columns.
'''
Here I'll use a new df we'll assign it to varible object df_1.
Here's the view of it.
'''
In[]: df_1
Out[]:
col_1 col_2 col_3 col_4 col_5 col_6 col_7 col_8 col_9
0 ITALY 332 3.38 16.0 461 969.0 260 30.0 -12
1 GREECE 198 7.00 42.0 770 107.0 229 85.0 -65
2 SPAIN 676 0.00 0.0 650 82.0 354 51.0 -56
3 ROME 0 2.00 76.0 408 84.0 538 98.0 -93
4 SOUTHAFRICA 923 5.00 40.0 686 346.0 352 31.0 -37
5 BRAZIL 272 6.07 0.0 891 0.0 348 0.0 94
6 ROMANIA 658 5.38 60.0 13 695.0 529 45.0 25
7 JAPAN 582 5.69 1.0 159 807.0 77 9.0 99
8 CANADA 236 7.38 39.0 797 633.0 14 65.0 72
9 INDIA 6 3.92 65.0 719 887.0 612 12.0 40
'''
Let's look at the data types of the columns.
'''
In[]: df_1.dtypes
Out[]:
col_1 object
col_2 object
col_3 float64
col_4 float64
col_5 int64
col_6 float64
col_7 object
col_8 float64
col_9 int64
dtype: object
'''
We can see that col_7 is a string object datatype. We need to convert
it to an integer or float type to use integer or float methods on the
data. There are few ways to do this.
'''
In[]: df_1.col_7 = df_1.col_7.astype(int) #dot syntax method
In[]: df_1.dtypes
Out[]:
col_1 object
col_2 object
col_3 float64
col_4 float64
col_5 int64
col_6 float64
col_7 int32 #col_7 data are now integers
col_8 float64
col_9 int64
dtype: object
'''
Here's an alternative syntax; changes the data to float data.
'''
In[]: df_1['col_7'] = df_1['col_7'].astype(float)
In[]: df_1.dtypes
Out[]:
col_1 object
col_2 object
col_3 float64
col_4 float64
col_5 int64
col_6 float64
col_7 float64 #col_7 data are now floats
col_8 float64
col_9 int64
dtype: object
Basic Date Time Methods
I have found datetime data to be a bit tricky but I have included here some methods
to work with it.
Here is our date time dataframe:
'''
Our datetime column here is actually string data.
'''
In[]: time_df
Out[]:
col_1 datetime col_3 col_4 col_5 col_6 col_7 col_8
0 Italy 2009-1-20 3.38 16 461 969 260 30.0
1 Greece 2009-2-23 7.00 42 770 107 229 85.0
2 Spain 2009-8-24 1.90 66 650 82 354 51.0
3 Rome 2010-7-24 2.00 76 408 84 538 98.0
4 South Africa 2010-8-22 5.00 40 686 346 352 31.0
5 Brazil 2010-7-21 6.07 22 891 33 348 55.0
6 Romania 2010-8-15 5.38 60 13 695 529 45.0
7 Japan 2011-4-22 5.69 1 159 807 77 9.0
8 Canada 2011-2-22 7.38 39 797 633 14 65.0
9 India 2011-8-25 3.92 65 719 887 61 12.0
'''
First look at the data types of the columns.
'''
In[]: time_df.dtypes
Out[]:
col_1 object
datetime object #datetime is currently string data
col_3 float64
col_4 int64
col_5 int64
col_6 int64
col_7 int64
col_8 float64
dtype: object
Convert a column to datetime data:
'''
The datetime column is currently a string object.
Convert it to a "date time" data type.
'''
In[]: time_df.datetime = pd.to_datetime(time_df['datetime'], format='%Y-%m-%d')
In[]: time_df.dtypes
Out[]:
col_1 object
datetime datetime64[ns]
col_3 float64
col_4 int64
col_5 int64
col_6 int64
col_7 int64
col_8 float64
dtype: object
Slice the rows by specific year:
'''
Here we will slice the data by the year 2010.
'''
In[]: time_df.loc[(time_df["datetime"].dt.year == 2010)]
Out[]:
col_1 datetime col_3 col_4 col_5 col_6 col_7 col_8
3 Rome 2010-07-24 2.00 76 408 84 538 98.0
4 South Africa 2010-08-22 5.00 40 686 346 352 31.0
5 Brazil 2010-07-21 6.07 22 891 33 348 55.0
6 Romania 2010-08-15 5.38 60 13 695 529 45.0
'''
There are a few different methods to accomplish filtering rows by date.
Here we slice all the rows with dates greater than 2009-1-1 and less than 2010-8-1.
'''
#Step 1: create a date mask variable object
In[]: date_mask = (time_df['datetime'] > '2009-01-01') & (time_df['datetime'] <= '2010-8-1')
#Step 2: assign a new variable object dataframe to a .loc slice of the dataframe
In[]: new_datetime_df = time_df.loc[date_mask]
In[]: new_datetime_df
Out[]:
col_1 datetime col_3 col_4 col_5 col_6 col_7 col_8
0 Italy 2009-01-20 3.38 16 461 969 260 30.0
1 Greece 2009-02-23 7.00 42 770 107 229 85.0
2 Spain 2009-08-24 1.90 66 650 82 354 51.0
3 Rome 2010-07-24 2.00 76 408 84 538 98.0
5 Brazil 2010-07-21 6.07 22 891 33 348 55.0
Convert the date time column to year, month, and day columns: